Fig 1: Proportion of variance explained. LDpred-Inf with sumamry statistics from UKBB, BBJ, meta-AFR, and meta-ALL. Points are empirical estimtes, and error bars are bootrstrap standard errors (n=5,000).

Fig 2: Proportion of variance explained. PRS calculated using linear combinations of PRS weights from LDpred-inf (PRS1, PRS2) and using local ancestry PRS (PRS3).HRS_AFR

Fig 3: Proportion of variance explained. PRS calculated using linear combinations of PRS weights from LDpred-inf (PRS1, PRS2) and using local ancestry PRS (PRS3).PMBB_AFR

## [[1]]
##           IID        PRS Summary_Stats
##    1: HG00096 -0.8527006           BBJ
##    2: HG00097  0.1212889           BBJ
##    3: HG00099 -0.5494812           BBJ
##    4: HG00100 -0.5839695           BBJ
##    5: HG00101 -0.4089522           BBJ
##   ---                                 
## 2544: NA21137 -0.8566551           BBJ
## 2545: NA21141 -1.7621960           BBJ
## 2546: NA21142 -2.0326330           BBJ
## 2547: NA21143 -0.6662671           BBJ
## 2548: NA21144 -1.6150280           BBJ
## 
## [[2]]
##           IID      PRS Summary_Stats
##    1: HG00096 27.05733         GIANT
##    2: HG00097 27.41120         GIANT
##    3: HG00099 27.12162         GIANT
##    4: HG00100 26.67773         GIANT
##    5: HG00101 26.31729         GIANT
##   ---                               
## 2544: NA21137 18.69068         GIANT
## 2545: NA21141 18.08475         GIANT
## 2546: NA21142 18.15647         GIANT
## 2547: NA21143 18.39023         GIANT
## 2548: NA21144 18.03324         GIANT
## 
## [[3]]
##           IID       PRS Summary_Stats
##    1: HG00096 0.2384030      META_AFR
##    2: HG00097 0.3688721      META_AFR
##    3: HG00099 0.5712746      META_AFR
##    4: HG00100 0.4977523      META_AFR
##    5: HG00101 0.5995154      META_AFR
##   ---                                
## 2544: NA21137 0.3496227      META_AFR
## 2545: NA21141 0.2533668      META_AFR
## 2546: NA21142 0.2311690      META_AFR
## 2547: NA21143 0.2501532      META_AFR
## 2548: NA21144 0.2052742      META_AFR
## 
## [[4]]
##           IID        PRS Summary_Stats
##    1: HG00096 0.05689507     META_AFR2
##    2: HG00097 0.08622545     META_AFR2
##    3: HG00099 0.09906398     META_AFR2
##    4: HG00100 0.11491700     META_AFR2
##    5: HG00101 0.12450070     META_AFR2
##   ---                                 
## 2544: NA21137 0.06857799     META_AFR2
## 2545: NA21141 0.03008647     META_AFR2
## 2546: NA21142 0.02576827     META_AFR2
## 2547: NA21143 0.04221819     META_AFR2
## 2548: NA21144 0.03798021     META_AFR2
## 
## [[5]]
##           IID       PRS Summary_Stats
##    1: HG00096 0.2550199      META_ALL
##    2: HG00097 0.4226962      META_ALL
##    3: HG00099 0.4593673      META_ALL
##    4: HG00100 0.5827615      META_ALL
##    5: HG00101 0.4737461      META_ALL
##   ---                                
## 2544: NA21137 0.4528811      META_ALL
## 2545: NA21141 0.2801209      META_ALL
## 2546: NA21142 0.1636204      META_ALL
## 2547: NA21143 0.3484188      META_ALL
## 2548: NA21144 0.4033137      META_ALL
## 
## [[6]]
##           IID       PRS Summary_Stats
##    1: HG00096 0.7188846     META_ALL2
##    2: HG00097 0.8236255     META_ALL2
##    3: HG00099 0.8149656     META_ALL2
##    4: HG00100 0.9149501     META_ALL2
##    5: HG00101 0.8167783     META_ALL2
##   ---                                
## 2544: NA21137 0.6600387     META_ALL2
## 2545: NA21141 0.4850991     META_ALL2
## 2546: NA21142 0.4344598     META_ALL2
## 2547: NA21143 0.5892258     META_ALL2
## 2548: NA21144 0.5828353     META_ALL2
## 
## [[7]]
##           IID      PRS Summary_Stats
##    1: HG00096 2.319422      META_EUR
##    2: HG00097 2.580297      META_EUR
##    3: HG00099 2.526723      META_EUR
##    4: HG00100 2.636959      META_EUR
##    5: HG00101 2.565912      META_EUR
##   ---                               
## 2544: NA21137 1.625801      META_EUR
## 2545: NA21141 1.298389      META_EUR
## 2546: NA21142 1.381026      META_EUR
## 2547: NA21143 1.416706      META_EUR
## 2548: NA21144 1.502949      META_EUR
## 
## [[8]]
##           IID          PRS Summary_Stats
##    1: HG00096  0.082338810      META_NEA
##    2: HG00097  0.165794200      META_NEA
##    3: HG00099  0.007156949      META_NEA
##    4: HG00100  0.050496770      META_NEA
##    5: HG00101  0.111681700      META_NEA
##   ---                                   
## 2544: NA21137  0.104231000      META_NEA
## 2545: NA21141  0.007553810      META_NEA
## 2546: NA21142 -0.096658380      META_NEA
## 2547: NA21143  0.020338990      META_NEA
## 2548: NA21144  0.014733670      META_NEA
## 
## [[9]]
##           IID         PRS Summary_Stats
##    1: HG00096 -0.02684305          PAGE
##    2: HG00097  0.12634790          PAGE
##    3: HG00099  0.17277890          PAGE
##    4: HG00100  0.07942150          PAGE
##    5: HG00101  0.11524650          PAGE
##   ---                                  
## 2544: NA21137 -0.11703410          PAGE
## 2545: NA21141 -0.15028210          PAGE
## 2546: NA21142 -0.33831460          PAGE
## 2547: NA21143  0.04048262          PAGE
## 2548: NA21144 -0.24686130          PAGE
## 
## [[10]]
##           IID           PRS Summary_Stats
##    1: HG00096 -0.5076630000          UKBB
##    2: HG00097 -0.3305627000          UKBB
##    3: HG00099 -0.1356251000          UKBB
##    4: HG00100  0.1636529000          UKBB
##    5: HG00101 -0.0002598876          UKBB
##   ---                                    
## 2544: NA21137  0.1901339000          UKBB
## 2545: NA21141 -0.3319316000          UKBB
## 2546: NA21142 -0.2989896000          UKBB
## 2547: NA21143  0.0275398500          UKBB
## 2548: NA21144  0.1834473000          UKBB

Fig 4: 1000 GENOMES

Background

In our previous work, we analysed the factors that drive reduced prediction accuracy of polygenic scores for height in individuals with African ancestry.

We saw that SFS and LD play a role, but there is also suggestive evidence that differences in marginal effect sizes exist.

In that study we ran a GWAS in ~8,000 individuals with African ancestry from the UKBB and tested for differences in marginal effect sizes between those and European derived effect sizes, as well as correlations of those differences with allelic frequency differences. Finally, we implemented ancestry-informed PRSs in the admixed individuals, and observed only very modest improvement in prediction accuracy.

It is possible that that modest improvement was due to our low sample size. So here we use a much larger sample size (about 58K African ancestry individuals and 91K total) to explore the potential of ancestry-informed PRSs for height. We also try a larger meta-analysis, with 58K African ancestry individuals and

Another interesting thing is to see whether by fine-mapping index variants by including African ancestry we can select SNPs that yield better PRS performance.

Questions:

  1. Do multi-PRS and LA-PRS increase in prediction by using effect-sizes from a meta-African analysis? What about a meta-Pan analysis?

  2. What is the overlap between GWAS hits between GWAS for EUR only and BBJ only, AFR only and combinations of those?

  3. When we select ancestry-specific index variants and then use those in the PRS, does prediction improve?

Methods

For now, we are focusing on height only.

GWAS summary statistics

We use GWAS summary statistics for height from six sources::

*UKBB_eur: UK Biobank Europeans

*BBJ: Biobank Japan

*Uganda Genome Project - which is a meta-analysis of Uganda + 3 other populations from Africa, described in the Uganda Genome Project paper);

  • UKBB_afr- from the African subset from the panUKBB dataset.

  • N’diaye et al. 2011 - still the largest height GWAS performed in African ancestry individuals;

  • PAGE, a large meta-analysis including 35% African Americans and the remaining participants are mostly of Hispanic/Latino and other minority ancestries.

So our meta-AFR analysis includes: PAGE, Ndiaye, UKBB_afr and 4 cohorts from UGP.

Our meta-all analysis includes: meta-AFR, UKBB_eur, BBJ.

Meta-analysis in METAL

We performed two meta-analysis:

*meta-AFR: UGP+pan-UKBB(AFR)+N’Diaye et al. 2011 meta-analysis, PAGE project. Total of 90,970 individuals (58488 of African ancestry). See Table 1.

*meta-ALL: Our meta-ALL analysis includes: UKBB_eur, meta-AFR (previous step), BBJ (Biobank Japan, N=159,095) Total of 610,453 (58488 of African ancestry). See Table 1.

Note that both have the same amount of African ancestry individuals (N=XX). We performed meta-ALL to check whether bigger sample size and increased diversity in the discovery cohort would improve predictions.

We ran a meta-analysis using METAL using one file for each of the above datasets. We set genomic correction to “ON”, meaning it is performed for each file (not the final values). We performed the meta-analysis using SCHEME STDERR, meaning betas and SE are used. For the meta-AFR analysis, we set AVERAGEFREQ and MINMAXFREQ to “ON” so that metal can track large allelic frequency differences across datasets as suggestion of allelic mismatch. We only report results for variants that have a combined weight of at least 49,781 (meta-AFR) or 590026 individuals, resulting in about 20 million autosomal variants in both datasets.

We inspected the p-value distribution of these meta-analyses using QQ-plots and calculated the genomic inflation on the final p-values, and performed corrections accordingly.

Data QC

Summary statistics QC

Most were in hg19 build, except N’diaye, which we lift over from hg18 to hg19. Previous filtering was done in each of these studies, and there is often not enough information for us to perform our own filtering.

  • UGP: this is very recent. They filtered for imputation score > 0.3.

  • pan-UKBB: They filter for INFO scores > 0.8 and minimum allele count of 20 in each population. They also provide a True/False filter for “low_quality_AFR” which we use, retaining only those for which it is ‘false’. GWAS included: Age, sex, Age*sex, Age2, Age2*sex, the first 10 PCs. Inverse-normal transformation of height in cm.

  • N’diaye et al.: The genomic control inflation (GC) factor was calculated for each study and used for within-study correction, prior to the meta-analysis. The overall lambda they report is 1.064 (which we confirm, see table below) suggesting no inflation in this meta-analysis. Imputation info score not available, but authors filtered for >= 0.3. Betas and SE in units of z-score.

PAGE: inverse-normal-adjusted residuals for each trait outcome. Info score available. Filtered for > 0.4 by authors prior. We were more strict and filtered for > 0.8.

As mentioned, for the meta-analyses summary statistics we only retained positions for which there was information for most individuals in the meta-analysis (20.7 and 23.7 M SNPs for meta-AFR and meta-ALL respectively).

For UKBB_eur, we retained only SNPs with INFO> 0.8 (11.9 M SNPs) and low_quality_variant=FALSE (15.4 M SNPs). Only autosomal SNPs were analyzed.

LD reference panels

For PRS using summary statistics from the UKBB_eur, we used the UKBB_eur (5,000 randomly sampled) imputed data as LD reference panel.For PRS using the BBJ summary statictics, we used a combination of the 1000G Phase 3 East Asians and UKBB Chinese individuals. For the meta-AFR summary statistics, we used a combination of UKBB_afr and 1000G Phase 3 African ancestry individuals. For the meta-ALL summary statistics, we used a combination of all Phase 3 1000G individuals, the UKBB_eur, UKBB_afr, and UKBB_chi. In all cases, the combined sets were QC’d to only include unrelated individuals (plink –rel-cutoff 0.125) and with genotype missingness < 0.85. We further restricted these sets to SNPs with MAF > 0.001. We further removed SNPs with allelic mismatch with the UKBB_EUR summary statistics file and corrected for strand flipping when appropriate.

Test data

For test data, we used the Penn Biobank subsets of European American and African American individuals (Table XX), the HRS subsets of European and African Americans, and the UKBB Chinese individuals (Table XX). Individuals with height further than two deviations from the sex-cohort specific mean were not included. (Table 3)

Test cohorts

Genotype data from test cohorts was lifted over to hg38 when needed.

  • PMBB (Penn Biobank): with sets of EUR (7501) and AFR ancestry individuals (9226)

  • UKB_CHI (UKBB Chinese): a set of 1,504 individuals with Chinese ancestry from the UK Biobank.

  • HRS (Health and Retirement Study): with sets of EUR (10,486) and AFR (2,322) ancestry individuals.

We visually inspected qq-plots of height residuals for each dataset to check for extreme outliers. Based on this inspection, we restricted PMBB (Figs 3-4 for before and after filtering) and HRS (Figs 5-6 for before and after filtering) samples to those for which residual height was between \(\pm3\) standard deviations from the mean for each sex. For UKB-CHI, no filtering was necessary (Fig 7). Height residuals were obtained by regressing height on all co-variates and their interactions for each individual:

\[height\sim Sex+Age+Age^2+Sex*Age+Sex*Age^2+pEUR+Sex*pEUR+Age*pEUR+Age^2*pEUR\]

, where \(p_{EUR}\) is the genome-wide average proportion of European ancestry for PMBB_afr and HRS_afr (estimated through RFMIx), and the European ancestry component (estimated through unsupervised ADMIXTURE with k=2) for UKB_CHI. For HRS_eur and PMBB_eur, we set \(p_{EUR}\) to 1.

When multiple time points were available for each individual, we retained the one corresponding to the latest height measure and age. All height phenotype data was formatted to be in centimeters.

Each test cohort was randomly divided into a “train” and a “test” set following the ratio of 0.15 (train) and 0.85 (test) for most datasets, except for UKB_CHI and HRS_afr, where we used 0.20:0.80 (Table 2). We performed a stratified split of the data using the initial_split function from the rsample R package. We used ‘Sex’ as strate, i.e, to maintain Sex proportions within training and testing sets similar (Table 2)

PRS calculations

We used LDpred for PRS calculations. For UKBB_eur summary statistics, we used the UKBB_eur as LD reference panel; for BBJ and meta-AFR we used East Asians and Africans from 1000G Phase 3, respectively. We first ran ldpred coord to coordinate the summary statistics, test and LD datasets. Next we ran the gibbs sampler. Many values of p did not covnerge, but typically p=1 and p=0.3 did converge, so we looked at those, as well as the infinitesimal model. See Table

PRS_eur: PRS using effect sizes (\(\beta\)) from UKBB_eur.

PRS_eas: PRS using effect sizes (\(\beta\)) from BBJ (all East Asian).

PRS_afr: PRS using effect sizes ((\(\beta\)) from the meta-AFR GWAS.

PRS_all: PRS using effect sizes ((\(\beta\)) from the meta-ALL GWAS.

  1. A simple linear regression model

\[height~Sex+Age+Age2+pEUR\]

\[height~Sex+Age+Age2+pEUR+PRS_{eur}\]

  1. PRS1_ML (described in Marquez-Luna et al. 2017 and Bitarello & Mathieson 2020)

  2. PRS2_BD - linear combination of PRS described in Bitarello & Mathieson 2020